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Abstract 

We pose causal inference as the problem of lear¬ 
ning to classify probability distributions. In 
particular, we assume access to a collection 
{(•Si, ^i)}r=i’ where each Si is a sample drawn 
from the probability distribution of XiXYi, and k 
is a binary label indicating whether “Xi —)■ Yi” or 
“Xi <— Yi’. Given these data, we build a causal 
inference rule in two steps. First, we featurize 
each Si using the kernel mean embedding asso¬ 
ciated with some characteristic kernel. Second, 
we train a binary classifier on such embeddings to 
distinguish between causal directions. We present 
generalization bounds showing the statistical con¬ 
sistency and learning rates of the proposed ap¬ 
proach, and provide a simple implementation that 
achieves state-of-the-art cause-effect inference. 
Furthermore, we extend our ideas to infer causal 
relationships between more than two variables. 


1. Introduction 

The vast majority of statistical learning algorithms rely on 
the exploitation of associations between the variables un¬ 
der study. Given the argument that all associations arise 
from underlying causal structures (Reichenbach, 1956), and 
that different structures imply different influences between 
variables, the question of how to infer and use causal knowl¬ 
edge in learning acquires great importance (Pearl, 2000; 
Scholkopf et ah, 2012). Traditionally, the most widely 
used strategy to infer the causal structure of a system is 
to perform interventions on some of its variables, while 
studying the response of some others. However, such in¬ 
terventions are in many situations unethical, expensive, or 
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even impossible to realize. Consequently, we often face the 
need of causal inference purely from observational data. 
In these scenarios, one suffers, in the absence of strong 
assumptions, from the indistinguishability between latent 
confounding {X ^ Z ^ Y) and direct causation {X Y 
ov X p- Y). Nevertheless, disregarding the impossibility 
of the task, humans continuously learn from experience to 
accurately infer causality-revealing patterns. Inspired by 
this successful learning, and in contrast to prior work, this 
paper addresses causal inference by unveiling such causal 
patterns directly from data. In particular, we assume access 
to a set {{Si, (i)}?=i’ where each Si is a sample set drawn 
from the probability distribution of Xi x Yi, and li is a 
binary label indicating whether “Xi Yi” or “Xi p- Y”. 
Using these data, we build a causal inference rule in two 
steps. First, we construct a suitable and nonparametric rep¬ 
resentation of each sample Si. Second, we train a nonlinear 
binary classifier on such features to distinguish between 
causal directions. Building upon this framework, we derive 
theoretical guarantees regarding consistency and learning 
rates, extend inference to the multivariate case, propose 
approximations to scale learning to big data, and obtain 
state-of-the-art performance with a simple implementation. 

Given the ubiquity of uncertainty in data, which may arise 
from noisy measurements or the existence of unobserved 
common causes, we adopt the probabilistic interpretation of 
causation from Pearl (2000). Under this interpretation, the 
causal structure underlying a set of random variables X = 
{Xi,..., Xfi), with joint distribution P, is often described 
in terms of a Directed Acyclic Graph (DAG), denoted by 
G = (U, E). In this graph, each vertex Vi G V is associated 
to the random variable Xi G X, and an edge Eji G E 
from Vj to Vi denotes the causal relationship “Xi G- X/’. 
More specifically, these causal relationships are defined by 
a structural equation model, each Xi G- /i(Pa(Xi), Ni), 
where fi is a function, Pa(Ai) is the parental set ofVi gV, 
and Ni is some independent noise variable. Then, causal 
inference is the task of recovering G from S ^ P”. 
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1.1. Prior Art 

We now briefly review the state-of-the-art on the inference 
of causal structures G from observational data S ~ P”. For 
a more thorough exposition, see, e.g., Mooij et al. (2014). 

One of the main strategies to recover G is through the ex¬ 
ploration of conditional dependencies, together with some 
other technical assumptions such as the Markov wd faith¬ 
fulness relationships between P and G (Pearl, 2000). This 
is the case of the PC algorithm (Spirtes et al., 2000), which 
allows the recovery of the Markov equivalence class of G 
without placing any restrictions on the structural equation 
model specifying the random variables under study. 

Causal inference algorithms that exploit conditional depen¬ 
dencies are unsuited for inference in the bivariate case. Con¬ 
sequently, a large body of work has been dedicated to the 
study of this scenario. First, the linear non-Gaussian causal 
model (Shimizu et al., 2006; 2011) recovers the true causal 
direction between two variables whenever their relationship 
is linear and polluted with additive and non-Gaussian noise. 
This model was later extended into nonlinear additive noise 
models (Hoyer et al., 2009; Zhang & Hyvarinen, 2009; Ste- 
gle et al., 2010; Kpotufe et al., 2013; Peters et al., 2014), 
which prefer the causal direction under which the alleged 
cause is independent from the additive residuals of some 
nonlinear fit to the alleged effect. Third, the information ge¬ 
ometric causal inference framework (Daniusis et al., 2012; 
Janzing et al., 2014) assumes that the cause random vari¬ 
able is independently generated from some invertible and 
deterministic mapping to its effect; thus, it is unlikely to 
And dependencies between the density of the former and the 
slope of the latter, under the correct causal direction. 

As it may be inferred from the previous exposition, there 
exists a large and heterogeneous array of causal inference 
algorithms, each of them working under a very specialized 
set of assumptions, which are sometimes difficult to test in 
practice. Therefore, there exists the need for a more flexible 
causal inference rule, capable of learning the relevant causal 
footprints, later used for inference, directly from data. Such 
a “data driven” approach would allow to deal with complex 
data-generating processes, and would greatly reduce the 
need of explicitly crafting identiflability conditions a-priori. 

A preliminary step in this direction distilled from the com¬ 
petitions organized by Guyon (2013; 2014), which phrased 
causal inference as a learning problem. In these compe¬ 
titions, the participants were provided with a large col¬ 
lection of cause-effect samples {{Si, li)}f^i, where Si = 
{ixij,yij)}{l'Si is drawn from the probability distribution 
of Xi X Yi, and f is a binary label indicating whether 
“Xi Yi” or “Yi —> Xi”. Given these data, most partici¬ 
pants adopted the strategy of i) crafting a vector of features 
from each Si, and ii) training a binary classifier on top of 


the constructed features and paired labels. Although these 
“data-driven” methods achieved state-of-the-art performance 
(Guyon, 2013), the laborious task of hand-crafting features 
renders their theoretical analysis impossible. 

In more specific terms, the approach described above is a 
learning problem with inputs being sample sets Si, where 
each Si contains samples drawn from the probability dis¬ 
tribution Pi{Xi, Yi). In a separate strand of research, there 
has been several attempts to learn from probability distri¬ 
butions in a principled manner (Jebara et al., 2004; Hein & 
Bousquet, 2004; Cuturi et al., 2005; Martins et al., 2009; 
Muandet et al., 2012). Szabo et al. (2014) presented the 
first theoretical analysis of distributional learning based on 
kernel mean embedding (Smola et al., 2007), with focus on 
kernel ridge regression. Similarly, Muandet et al. (2012) 
studied the problem of classifying distributions, but their ap¬ 
proach is constrained to kernel machines, and no guarantees 
regarding consistency or learning rates are provided. 

1.2. Our Contribution 

Inspired by Guyon’s competitions, we pose causal infer¬ 
ence as the problem of classifying probability measures on 
causally related pairs of random variables. Our contribution 
to this framework is the use of kernel mean embeddings 
to nonparametrically featurize each cause-effect sample Si. 
The benefits of this approach are three-fold. First, this 
avoids the need of hand-engineering features from the sam¬ 
ples Si. Second, this enables a clean theoretical analysis, 
including provable learning rates and consistency results. 
Third, the kernel hyperparameters (that is, the data repre¬ 
sentation) can be jointly optimized with the classifier using 
cross-validation. Furthermore, we show how to extend these 
ideas to infer causal relationships between d>2 variables, 
give theoretically sustained approximations to scale lear¬ 
ning to big data, and provide the source code of a simple 
implementation that outperforms the state-of-the-art. 

The rest of this article is organized as follows. Section 2 
reviews the concept of kernel mean embeddings, the tool 
that will facilitate learning from distributions. Section 3 
shows the consistency and learning rates of our kernel mean 
embedding classification approach to cause-effect inference. 
Section 4 extends the presented ideas to the multivariate 
causal inference case. Section 5 presents a variety of ex¬ 
periments displaying the state-of-the-art performance of a 
simple implementation of the proposed framework. For 
convenience. Table 1 summarizes our notations. 

2. Kernel Mean Embeddings of Probability 
Measures 

In order to later classify probability measures P according 
to their causal properties, we first need to featurize them into 
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E[ej,V[ej 

Expected value and variance of r.v. ^ 

Z 

Domain of cause-effect pairs Z = (X, Y) 

V 

Set of cause-effect measures P on Z 

c 

Set of labels U G { — 1,1} 


Mother distribution over V x C 

{{PHm=i 

Sample from M" 

Si={z,,y;U 

Sample from Pf' 

PSi 

Empirical distribution of Si 

k 

Kernel function from Z x Z toM. 

Hk 

RKHS induced by k 

hk{P) 

Kernel mean embedding of measure P gV 

f^k {Psi ) 

Empirical mean embedding of Ps^ 

hk{V) 

The set |pfc(P) ■■ P GV\ 

Mfe 

Measure over tik{P) x C induced by M 

Pk 

Class of functionals mapping Hk to M. 

Rn{Pk) 

Rademacher complexity of class Pk 

R<p{f) 

Cost and surrogate 99 -risk of sign of 


Table 1. Table of notations 


a suitable representation. To this end, we will rely on the 
concept of kernel mean embeddings (Berlinet & Thomas- 
Agnan, 2004; Smola et al., 2007). 

In particular, let P be the probability distribution of some 
random variable Z taking values in the separable topological 
space {Z, Tz). Then, the kernel mean embedding of P asso¬ 
ciated with the continuous, bounded, and positive-dehnite 
kernel function fc : Z x Z —)■ K is 

IJ-kiP) ■= J k{z,-)dP{z), (1) 

which is an element inHk, the Reproducing Kernel Hilbert 
Space (RKHS) associated with k (Scholkopf & Smola, 
2002). Interestingly, the mapping is injective if k is 
a characteristic kernel (Sriperumbudur et ah, 2010), that is, 
||pfc(P) - HkiQ)\\'Hk = 0 ^ P ^ Q. Said differently, if 
using a characteristic kernel, we do not lose any information 
when embedding distributions. An example of characteristic 
kernel is the Gaussian kernel 

k{z, z') = exp (- 7 ||z - z'Wl) , 7 > 0 , ( 2 ) 

which will be used throughout this paper. 

In many practical simations, it is unrealistic to assume ac¬ 
cess to the true distribution P, and consequently to the true 
embedding gLk{P). Instead, we often have access to a sam¬ 
ple S = {-ZijjLi which can be used to construct 

the empirical distribution P 5 := ^ "'here S(^z) 

is the Dirac distribution centered at 0 . Using Ps, we can 
approximate ( 1 ) by the empirical kernel mean embedding 

1 ” 

Pk{Ps)-=-y^k{z,,-)enk. ( 3 ) 

n ^' 

i=l 

The following result is a slight modihcation of Theorem 27 
from (Song, 2008). It establishes the convergence of the 


empirical mean embedding p.k{Ps) to the embedding of its 
population counterpart p,k{P), in RKHS norm: 

Theorem 1. Assume that ||/||oo < 1 for all f G Hk with 
ll/ll«fc < 1. Then with probability at least 1 — (5 we have 

\\lik{P) - Ttk{Ps)W < 2 

V n y n 

Proof. See Section B.l. □ 

3 . A Theory of Causal Inference as 
Distribution Classification 

This section phrases the inference of cause-effect relation¬ 
ships from probability measures as the classihcation of em¬ 
pirical kernel mean embeddings, and analyzes the learning 
rates and consistency of such approach. Throughout our 
exposition, the setup is as follows: 

1. We assume the existence of some Mother distribution 
M, defined onV x C, where V is the set of all Borel 
probability measures on the space Z of two causally 
related random variables, and C = {—1,-|-1}. 

2. Aset {{Pi, h)}7-i sampled from M”. Each measure 
Pi G V is the joint distribution of the causally related 
random variables Zi = {Xi, Tj), and the label U G C 
indicates whether “Xi Y" or “Xi G- Y" ■ 

3. In practice, we do not have access to the mea¬ 
sures {Pijf^i- Instead, we observe samples Si = 
{{xij,yij)}]Li ~ Pr^’ for all 1 < i < n. Using Si, 
the data {{Si, li)}2=i is provided to the learner. 

4. We featurize every sample Si into the empirical kernel 
mean embedding pk {PSi) associated with some kernel 
function k (Equation 3). If fc is a characteristic kernel, 
we incur no loss of information in this step. 

Under this setup, we will use the set {{p,k{PSi)t h)}i'=i to 
train a binary classifier from Hi,, to C, which will later be 
used to unveil the causal directions of new, unseen probabil¬ 
ity measures drawn from M. Note that this framework can 
be straightforwardly extended to also infer the “confounding 
(X •(— Z —> U)” and “independent (X JL Y)” cases by 
adding two extra labels to C. 

Given the two nested levels of sampling (being the first 
one from the Mother distribution M, and the second one 
from each of the drawn cause-effect measures Pi), it is 
not trivial to conclude whether this learning procedure is 
consistent, or how its learning rates depend on the sample 
sizes n and In the following, we will study the 

generalization performance of empirical risk minimization 
over this learning setup. Specifically, we are interested in 
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upper bounding the excess risk between the empirical risk 
minimizer and the best classiher from our hypothesis class, 
with respect to the Mother distribution M. 

We divide our analysis in three parts. First, §3.1 reviews the 
abstract setting of statistical learning theory and surrogate 
risk minimization. Second, §3.2 adapts these standard re¬ 
sults to the case of empirical kernel mean embedding clas- 
sihcation. Third, §3.3 considers theoretically sustained ap¬ 
proximations to deal with big data. 

3.1. Margin-based Risk Bounds in Learning Theory 

Let P be some unknown probability measure defined on 
Z X £, where Z is referred to as the input space, and 
£ = {—1,1} is referred to as the output space^. One of the 
main goals of statistical learning theory (Vapnik, 1998) is to 
find a classiher h: Z —>■ C that minimizes the expected risk 

for a suitable loss function i: CxC ^ K’*', which penalizes 
departures between predictions h{z) and true labels 1. For 
classihcation, one common choice of loss function is the 0-1 
loss £oi{l,l') = for which the expected risk measures 

the probability of misclassihcation. Since P is unknown in 
natural situations, one usually resorts to the minimization 
of the empirical risk ^ X]r=i ^i) some hxed 

hypothesis class H, for the training set {{zi, li)}2=i ^ IP"- 
It is well known that this procedure is consistent under 
mild assumptions (Boucheron et ak, 2005). 

Unfortunately, the 0-1 loss function is not convex, which 
leads to empirical risk minimization being generally in¬ 
tractable. Instead, we will focus on the minimization of 
surrogate risk functions (Bartlett et ak, 2006). In partic¬ 
ular, we will consider the set of classihers of the form 
H = {sign of: f g where IF is some hxed set of real¬ 
valued functions /: Z —> M. Introduce a nonnegative cost 
function (/?:!&—>■ IR+ which is surrogate to the 0-1 loss, that 
is, (p{e) > le>o- For any f G J- we dehne its expected and 
empirical tp-risks respectively as 


RAf) = , P M-fAi)], 

(4) 

1 "" 

RAf) = - 

^ 2 = 1 

(5) 


Many natural choices of ip lead to tractable empirical risk 
minimization. Common examples of cost functions include 
the hinge loss (p{e) = max(0,1 + e) used in SVM, the 
exponential loss ip{e) = exp(e) used in Adaboost, and the 
logistic loss ip{e) = log 2 (l + used in logistic regression. 

'Refer to Section A for considerations on measurability. 


The misclassihcation error of sign o / is always upper 
bounded by Rip{f). The relationship between functions 
minimizing R^p{f) and functions minimizing i?(signo/) 
has been intensively studied in the literature (Steinwart & 
Christmann, 2008, Chapter 3). Given the high uncertainty 
associated with causal inferences, we argue that one is inter¬ 
ested in predicting soft probabilities rather than hard labels, 
a fact that makes the study of margin-based classihers well 
suited for our problem. 

We now focus on the estimation of f* G T, the function 
minimizing (4). However, since the distribution P is un¬ 
known, we can only hope to estimate /„ G T, the func¬ 
tion minimizing (5). Therefore, we are interested in high- 
probability upper bounds on the excess ip-risk 

£Afn)=RMn)-RM*)^ ( 6 ) 

w.r.t. the random training sample {(zj, ~ P"- The 

excess risk (6) can be upper bounded in the following way: 

£Afn) < RMn) - RMn) + MH " R^f) 

<2snp\RAf)-RM)\- (7) 

While this upper bound leads to tight results for worst case 
analysis, it is well known (Bartlett et ak, 2005; Boucheron 
et ak, 2005; Koltchinskii, 2011) that tighter bounds can be 
achieved under additional assumptions on P. However, we 
leave these analyses for future research. 

The following result — in spirit of Koltchinskii & 
Panchenko (1999); Bartlett & Mendelson (2002) — can 
be found in Boucheron et ak (2005, Theorem 4.1). 

Theorem 2. Consider a class T of functions mapping Z 
to M. Let K —>■ IR+ be a L^-Lipschitz function such 
that ip(e) > le>o- Let B be a uniform upper bound on 
ip{-f{e)l). Let {{zi,li)A^i ^ P and {cri}f^i be i.i.d. 
Rademacher random signs. Then, with prob. at least 1 — ( 5 , 


sup|i?^(/)-i?^(/)| 
fer 

V 2n ’ 

where the expectation is taken w.r.t. {ai, 

The expectation in the bound of Thm. 2 is known as the 
Rademacher complexity of F, will be denoted by Rn{F), 
and has a typical order of (Koltchinskii, 201 1). 


< 2L^E 


1 

sup — 




i=l 


3.2. From Classic to Distributional Learning Theory 

Note that we can not directly apply the empirical risk min¬ 
imization bounds discussed in the previous section to our 
learning setup. This is because instead of learning a classi¬ 
fier on the i.i.d. sample {p,k{Pi), we have to learn 
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over the set where Si ~ Said differ¬ 

ently, our input feature vectors Hk{PSi) are “noisy”: they 
exhibit an additional source of variation as any two different 
random samples ^ do. In the following, we 

study how to incorporate these nested sampling effects into 
an argument similar to Theorem 2. 

We will now frame our problem within the abstract learning 
setting considered in the previous section. Recall that our 
learning setup initially considers some Mother distribution 
M over V x C. Let /ifc(P) = {/rfc(P) : P G P} C 
T-Lk, C = {—1, +1}, and be a measure (guaranteed to 
exist by Lemma 2, Section A.l) on Hk{P) x P induced 
by M. Specifically, we will consider HkiP) Q P/c and C 
to be the input and output spaces of our learning problem, 
respectively. Let { [^k{Pi), h) ~ training 

set. We will now work with the set of classifiers {sign o 
f: f G Pi.) for some fixed class Pj. of functionals mapping 
from the RKHS Hk to K. 

As pointed out in the description of our learning setup, we do 
not have access to the distributions {PijtLi but to samples 
Si ~ P"% for all 1 < z < n. Because of this reason, we 
define the sample-based empirical ip-risk 

n 

Kif) = ^Y.^{-hf{lik{Ps^)). 


distances \\p,k{Pi) — p,k{Psi)\\'Hk^ which are in turn upper 
bounded using Theorem 1. To this end, we will have to 
assume that the class P^ consists of functionals with uni¬ 
formly bounded Lipschitz constants, such as the set of linear 
functionals with uniformly bounded operator norm. 

We now present the main result of this section, which pro¬ 
vides a high-probability bound on the excess risk (9). Im¬ 
portantly, this excess risk will translate into the expected 
causal inference accuracy of our distribution classifier. 

Theorem 3. Consider the RKHS Hk associated with 
some bounded, continuous kernel function k, such that 
suPzgz k(z, z) < 1. Consider a class Tk of function¬ 
als mapping Hk to K with Lipschitz constants uniformly 
bounded by Ljr. Let tp: K —>■ K’*' be a L^-Lipschitz func¬ 
tion such that (j){z) > lz>o- Let (p(^—f{h)tj < B for every 
f G Pfc, h G Hk, cmd I G C. Then, with probability not less 
than 1 — (5 (overall sources of randomness) 

RMn) - RM*) < ^L^RniPk) + 

AL^Ljr ^ / I e ^r.pfk(z^^ / log(2n/(5) \ 
rz ^^ \ y Tii y 277.2 j 

Proof See Section B.2. □ 


which is the approximation to the empirical (p-risk R^{f ) 
that results from substituting the embeddings p.k{Pi) with 
their empirical counterparts p,k{Psi)- 

Our goal is again to find the function f* G Pfe minimizing 
expected (/3-risk R^{f). Since Mfc is unknown to us, and 
we have no access to the embeddings {p,k{Pi)}2=i, we will 
instead use the minimizer of R^{f) in P^: 

/„ G arg min .R^(/). (8) 

To sum up, the excess risk (6) can now be reformulated as 

RMn)-RM*)- ( 9 ) 

Note that the estimation of f* drinks from two nested 
sources of error, which are i) having only n training samples 
from the distribution Mfe, and ii) having only rz/ samples 
from each measure Pi. Using a similar technique to (7), we 
can upper bound the excess risk as 

RMn) - RMl < sup \RM) - RM)\ (10) 

+ swp \R.^{f) - R^{f)\. (11) 

feJ^k 

The term (10) is upper bounded by Theorem 2. On the 
other hand, to deal with (11), we will need to upper bound 
the deviations \f{fJ,k{Pi)) — f {pkiPSi)) \ in terms of the 


As mentioned in Section 3.1, the typical order of Rn{Pk) 
is For a particular examples of classes of func¬ 

tionals with small Rademacher complexity we refer to Mau¬ 
rer (2006). In such cases, the upper bound in Theorem 3 
converges to zero (meaning that our procedure is consis¬ 
tent) as both 77 and rii tend to infinity, in such a way that^ 
logn/rii = o(l). The rate of convergence w.r.t. n can be 
improved up to 0{n~^) if placing additional assumptions 
on M (Bartlett et ah, 2005). On the contrary, the rate w.r.t. 
rzi cannot be improved in general. Namely, the convergence 
rate presented in the upper bound of Theorem 1 

is tight, as shown in the following novel result. 

Theorem 4. Under the assumptions of Theorem 1 denote 

^Hk = sup Y^r.p[f{z)]. 

\\f\\nk<t 


Then there exist universal constants c, C such that for every 
integer n > and with probability at least c 


\\MP)-MPs)\\Hk>c^. 

V Tl 


Proof. See Section B.3. 


□ 


Finally, it is instructive to relate the notion of “identifiability” 
often considered in the causal inference community (Pearl, 

^ We conjecture that this constraint is an artifact from our proof. 
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2000) to the properties of the Mother distribution. Saying 
that the model is identifiable means that the label lof P G V 
is assigned deterministically by M. In this case, learning 
rates can become as fast as 0{n~^). On the other hand, as 
M(Z|P) becomes nondeterministic, the problem becomes 
unidentifiable and learning rates slow down (for example, 
in the extreme case of cause-effect pairs related by linear 
functions polluted with additive Gaussian noise, M(Z = 
-|-1|P) = M(( = —Ilf*) almost surely). The investigation 
of these phenomena is left for future research. 

3.3. Low Dimensional Embeddings for Large Data 

For some kernel functions, the embeddings Hk{Ps) C P-k 
are infinite dimensional. Because of this reason, one must 
resort to the use of dual optimization problems, and in par¬ 
ticular, kernel matrices. The construction of these matrices 
requires at least O(n^) computational and memory require¬ 
ments, prohibitive for large n. In this section, we show 
that the inhnite-dimensional embeddings Hk{Ps) & P-k can 
be approximated with easy to compute, low-dimensional 
representations (Rahimi & Recht, 2007; 2008). This will 
allow us to replace the infinite-dimensional minimization 
problem ( 8 ) with a low-dimensional one. 

Assume that Z = and that the kernel function k is real¬ 
valued, and shift-invariant. Then, we can exploit Bochner’s 
theorem (Rudin, 1962) to show that, for any z,z' G Z: 

k(z,z')= 2CkE [cos((w, z)+ b) cos((w, z')+ b)], (12) 

w,b 

where w ~ b ~ (7[0, 27r], pk- Z —)■ K is the pos¬ 

itive and integrable Fourier transform of k, and Ck = 
J 2 : Pk(w)dw. For example, the squared-exponential kernel 
( 2 ) is shift-invariant, and its evaluations can be approximated 
by (12), if setting pfc(w) = A/’(w|0, 27 /), and Ck = 1. 

We now show that for any probability measure Q on Z and 
z G Z, the function k{z,-) G Pk ^ L 2 {Q) can be ap¬ 
proximated by a linear combination of randomly chosen 
elements from the Hilbert space L 2 {Q). Namely, consider 
the functions parametrised hy w, z G Z and b G [0, 27r]: 

9wA’) = cos((u;, z) + b) cos({w, •) -f 6), (13) 

which belong to L 2 {Q), since they are bounded. If we 
sample bj)}JLi i.i.d., as discussed above, the average 

.. 771 

‘ i=l 

can be viewed as an L 2 (Q)-valued random variable. More¬ 
over, (12) shows that = k{z, •). This enables 

us to invoke concentration inequalities for Hilbert spaces 
(Ledoux & Talagrand, 1991), to show the following result, 
which is in spirit to Rahimi & Recht (2008, Lemma 1). 


Lemma 1. Let Z = For any shift-invariant kernel k, 
s.t. sup^g 2 k{z,z) < 1, any fixed S = C Z, any 

probability distribution Q on Z, and any 5 > 0, we have 

9k{Ps)-lj2m-) 

‘ i=l 

with probability larger than 1 — <5 over {(wi, 

Proof. See Section B.4. □ 

Once sampled, the parameters {{wi,bi)}'^i al¬ 
low us to approximate the empirical kernel mean 
embeddings {Pk{PSi)}^-i using elements from 
span({cos((wi, •) -I- which Is a hnite-dlmensional 

subspace of L 2 {Q). Therefore, we propose to use 
{{Pk,m{PSi) as the training sample for our final 
empirical risk minimization problem, where 

Pk,m{Ps) = 

These feature vectors can be computed in 0{m) time and 
stored in 0(1) memory; importantly, they can be used off- 
the-shelf In conjunction with any learning algorithm. 

For the precise excess risk bounds that take into account the 
use of these low-dimensional approximations, please refer 
to Theorem 6 from Section B.5. 

4. Extensions to Multivariate Causal 
Inference 

It is possible to extend our framework to infer causal relaton- 
ships between d>2 variables X = (2fi,..., Xfi). To this 
end, and as introduced in Section 1, assume the existence 
of a causal directed acyclic graph G which underlies the 
dependencies present in the probability distribution P{X). 
Therefore, our task is to recover G from S ~ P". 

Naively, one could extend the framework presented in Sec¬ 
tion 3 from the binary classification of 2-dimensional dis¬ 
tributions to the multiclass classification of d-dimensional 
distributions. However, the number of possible DAGs (and 
therefore, the number of labels in our multiclass classihca- 
tion problem) grows super-exponentially in d. 

An alternative approach is to consider the probabilities of 
the three labels ^ Xf, “Xi G- Xf‘, and “Xi JL X/‘ 
for each pair of variables {Xi, Xj} C X, when embedded 
along with every possible context Xk Q X \ {Xi,Xj}. 
The intuition here is the same as in the PC algorithm of 
Spirtes et al. (2000); in order to decide the (absence of a) 
causal relationship between Xi and Xj, one must analyze 
the confounding effects of every Xk G X \ {Xi, Xj}. 


T.ofO'l 
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5. Numerical Simulations 

We conduct an array of experiments to test the effectiveness 
of a simple implementation of the presented causal learning 
framework. Given the use of random embeddings (14) in 
our classifier, we term our method the Randomized Cau¬ 
sation Coefficient (RCC). Throughout our simulations, we 
featurize each sample S = {{xi, as 

= {R‘k,m{Ps^)i d'k,m{PSy)i R‘k,m{PS:cy))'’ (15) 

where the three elements forming (15) stand for the low¬ 
dimensional representations (14) of the empirical kernel 
mean embeddings of and {(a;*, ?/i)}”=i, 

respectively. The representation (15) is motivated by the 
typical conjecture in causal inference about the existence 
of asymmetries between the marginal and conditional dis¬ 
tributions of causally-related pairs of random variables 
(Scholkopf et ah, 2012). Each of these three embeddings 
has random features sampled to approximate the sum of 
three Gaussian kernels (2) with hyper-parameters O.I 7 , 7 , 
and IO 7 , where 7 is found using the median heuristic. In 
practice, we set m = 1000 , and observe no significant im¬ 
provements when using larger amounts of random features. 
To classify the embeddings (15) in each of the experiments, 
we use the random forest^ implementation from Python’s 
sklearn-0.16-git. The number of trees is chosen 
from {100, 250, 500,1000,5000} via cross-validation. 

Our experiments can be replicated using the source code at 

https : //github . com/lopezpaz/causation_learning_theory. 

5 . 1 . Classification of Tiibingen Cause-Effect Pairs 

The Tubingen cause-effect pairs is a collection of hetero¬ 
geneous, hand-collected, real-world cause-effect samples 
(Zscheischler, 2014). Given the small size of this dataset, 
we resort to the synthesis of an artificial Mother distribu¬ 
tion to sample our training data from. To this end, as¬ 
sume that sampling a synthetic cause-effect sample set 
Si := {(iij, Ve equals the following simple 

generative process: 

1. A cause vector is sampled from a mixture of 

Gaussians with c components. The mixture weights 
are sampled fromZ//( 0 , 1 ), and normalized to sum to 
one. The mixture means and standard deviations are 
sampled from A/'(0, tri), and A/'(0, (T 2 ), respectively, 
accepting only positive standard deviations. The cause 
vector is standardized to zero mean and unit variance. 

2. A noise vector is sampled from a centered 

^Although random forests do not comply with Lipschitzness 
assumptions from Section 3, they showed the best empirical results. 
Compliant alternatives such as SVMs exhibited a typical drop in 
classification accuracy of 5%. 



Figure 1. Accuracy of RCC, IGCI and ANM on the Tubingen 
cause-effect pairs, as a function of decision rate. The grey area 
depicts accuracies not statistically significant. 

Gaussian, with variance sampled fromZ//(0, 173 ). 

3. A mapping mechanism fi is conceived as a spline 
fitted using an uniform grid of df elements from 

to max((xij)"^;^) as inputs, and df 
normally distributed outputs. 

4. An effect vector is built as := fi{xij) + 
and standardized to zero mean and unit variance. 

5. Return the cause-effect sample Si := {{xij,yij)}^^i. 

To choose a 9 = (c, cti, (72 , ua, d/) that best resembles the 
unlabeled test data, we minimize the distance between the 
embeddings of N synthetic pairs and the Tuebingen samples 

argmin^ min ||i/(S'i) - i'iSj)\\l, (16) 
e t<3<N 

I 

over cffif G { 1 ,..., 10 }, and (Ti,(T 2 , and (J 3 G 
jo, 0.5,1,..., 5}, where the Sj ^ Ve, the Si are the 
Tubingen cause-effect pairs, and u is as in (15). This strat¬ 
egy can be thought of as transductive learning, since we 
assume to know the test inputs prior to the training of our 
inference rule. We set n = 1000, and N = 10, 000. 

Using the generative process outlined above, we construct 
the synthetic training data 

where {(iiy, Ps, and train our classifier on it. 

Figure 1 plots the classification accuracy of RCC, IGCI 
(Daniusis et ah, 2012), and ANM (Mooij et ah, 2014) versus 
the fraction of decissions that the algorithms are forced to 
make out of the 82 scalar Tuebingen cause-effect pairs. To 
compare these results to other lower-performing methods. 
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refer to Janzing et al. (2012). RCC surpasses the state-of-the- 
art with a classification accuracy of 81.61% when inferring 
the causal directions on all pairs. The confidence of RCC 
is computed using the classifier’s output class probabilities. 
SVMs obtain a test accuracy of 77.2% in this same task. 

5.2. Inferring the Arrow of Time 

We test the effectiveness of our method to infer the arrow of 
time from causal time series. More specifically, we assume 
access to a set of time series and our task is to 

infer, for each series, whether Xi Xi+i or Xi ^ Aj+i. 

We compare our framework to the state-of-the-art of Peters 
et al. (2009), using the same electroencephalography signals 
(Blankertz, 2005) as in their original experiment. On the one 
hand, Peters et al. (2009) construct two Auto-Regressive 
Moving-Average (ARMA) models for each causal time se¬ 
ries and time direction, and prefers the solution under which 
the model residuals are independent from the inferred cause. 
To this end, the method uses two parameters for which 
no estimation procedure is provided. On the other hand, 
our approach makes no assumptions whatsoever about the 
parametric model underlying the series, at the expense of 
requiring a disjoint set of TV = 10, 000 causal time series 
for training. Our method matches the best performance of 
Peters et al. (2009), with an accuracy of 82.66%. 

5.3. ChaLearn’s Challenge Data 

The cause-effect challenges organized by Guyon (2014) 
provided N = 16,199 training causal samples Si, each 
drawn from the distribution of Xi x Yi, and labeled either 

“X, ^ y,”, “X, ^ r/’, “x, y -, or “x, x y,”. 

The task of the competition was to develop a causation co¬ 
efficient which would predict large positive values to causal 
samples following “X^ Y”, large negative values to 

samples following “X^ ^ Y”, and zero otherwise. Using 
these data, our obtained a test bidirectional area under the 
curve score (Guyon, 2014) of 0.74 in one minute and a half, 
ranking third in the overall leaderboard. The winner of the 
competition obtained a score of 0.82 in thirty minutes, but 
resorted to several dozens of hand-crafted features. 

Partitioning these same data in different ways, we learned 
two related but different binary classification tasks. First, 
we trained our classifier to detect latent confounding, and 
obtained a test classification accuracy of 80% on the task of 
distinguishing “X y or X ^ X” from ‘"X ^ Z ^ Y”. 
Second, we trained our classifier to measure dependence, 
and obtained a test classification accuracy of 88% on the 
task of distinguishing between “X JL Y” and “else”. We 
consider these results to be a promising direction to learn 
flexible hypothesis tests and dependence measures directly 
from data. 


5.4. Reconstruction of Causal DAGs 

We apply the strategy described in Section 4 to reconstruct 
the causal DAGs of two multivariate datasets: autoMPG 
and abalone (Lichman, 2013). Once again, we resort to 
synthetic training data, generated in a similar procedure to 
the one used in Section 5.1. Refer to Section C for details. 

Regarding autoMPG, in Figure 2, 1) the release date of 
the vehicle (AGE) causes the miles per gallon consumption 
(MPG), acceleration capabilities (ACC) and horse-power 
(HP), 2) the weight of the vehicle (WEI) causes the horse¬ 
power and MPG, and that 3) other characteristics such as the 
engine displacement (DIS) and number of cylinders (CYL) 
cause the MPG. Eor abalone, in Eigure 3, 1) the age of the 
snail causes all the other variables, 2) the overall weight 
of the snail (WEI) is caused by the partial weights of its 
meat (WEA), viscera (WEB), and shell (WEC), and 3) the 
height of the snail (HEI) is responsible for other phisicaly 
attributes such as its diameter (DIA) and length (LEN). 

The target variable for each dataset is shaded in gray. Inter- 
stingly, our inference reveals that the autoMPG dataset is 
a causal prediction task (the features cause the target), and 
that the abalone dataset is an anticausal prediction task (the 
target causes the features). This distinction has implications 
when learning from these data (Scholkopf et al., 2012). 



Figure 2. Causal DAG recovered from data autoMPG. 



Figure 3. Causal DAG recovered from data abalone. 


6 . Future Work 

Three research directions are in progress. Eirst, to improve 
learning rates by using common assumptions from causal 
inference. Second, to further investigate methods to recon¬ 
struct multivariate DAGs. Third, to develop mechanisms to 
interpret the causal footprints learned by our classifiers. 
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A. Topological and Measurability Considerations 

Let {Z,Tz) and {C,tc) be two separable topological spaces, where Z is the input space and C := {—1,1} is the 
output space. Let B{t) be the Borel cr-algebra induced by the topology t. Let P be an unknown probability measure on 

{Zx C,B{tz)^B{tc)). 

Consider also the classifiers f G Tk and loss function £ to be measurable. 

A. l. Measurability Conditions to Learn from Distributions 

The first step towards the deployment of our learning setup is to guarantee the existence of a measure on the space pkiT’)x C, 
where PkiV) = {pk{P) '■ P G P} ^ T~Lk is the set of kernel mean embeddings associated with the measures in V. The 
following lemma provides such guarantee. This allows the analysis within the rest of this Section on Pk{P) x B- 

Lemma 2. Let (Z,tz) and [C,tc) be two separable topological spaces. Let V be the set of all Borel probability measures 
on {Z,B{tz)). Let Pk{P) = {Bk{P) '■ P G P} Q Ph where pk B the kernel mean embedding (1) associated to some 
bounded continuous kernel function k : Z x Z Then, there exists a measure on Pk{P) x C. 

Proof. The following is a similar result to Szabo et al. (2014, Proof 3). 

Start by endowing P with the weak topology r-p, such that the map 

L{P) = Jj{z)AP{z), (17) 

is continuous for all / S Cb{Z). This makes {P, B{tp)) a measurable space. 

First, we show that pk '■ {P, Birp)) —>■ {Pk, B{tp)) is Borel measurable. Note that TLk is separable due to the separability 
of {Z, Tz) and the continuity of k (Steinwart (& Christmann, 2008, Lemma 4.33). The separability of Pk implies pk is Borel 
measurable iff it is weakly measurable (Reed & Simon, 1972, Thm. IV.22). Note that the boundedness and the continuity of 
k imply TLk Q Cb{Z) (Steinwart & Christmann, 2008, Lemma 4.28). Therefore, (17) remains continuous for all / S TLk, 
which implies the Borel measurability of pk. 

Second, pk : {P, B{tp)) —>• {Q, B[Tg)) is Borel measnrable, since the B{Tg) = {Ai^Q : A G B{Pk)} Q B{t-h), where 
B{Tg ) is the cr-algebra induced by the topology of Q G B{Pk) (Szabo et al., 2014). 

Third, we show that g : {P x C,B{tp) ® B{tc)) —t{Gx C,B{Tg) 0 B{tc)) is measurable. For that, it suffices to 
decompose g{x, y) = {gi{x, y), g 2 {x, y)) and show that gi and 52 are measurable (Szabo et al., 2014). □ 

B. Proofs 

B.l. Theorem 1 

Note that the original statement of Theorem 27 in Song (2008) assumed / G [0,1] while we let elements of the ball in 
RKHS to take negative values as well which can be achieved by minor changes of the proof. For completeness we provide 
the modified proof here. Using the well known dual relation between the norm in RKHS and sup-norm of empirical process 
which can be found in Theorem 28 of Song (2008) we can write; 

llMk(P) - Tk(Ps)lln, = sup ( E [/( 2 ;)] --^/(z,)y (18) 

Now we proceed in the usual way. First we note that the sup-norm of empirical process appearing on the r.h.s. can be viewed 
as a real-valued function of i.i.d. random variables zi,... ,Zn. We will denote it as P(zi,..., Zn). The straightforward 
computations show that the function F satisfies the bounded difference condition (Theorem 14 of Song (2008)). Indeed, let 
us fix all the values zi,..., except for the zj which we will set to z'. Using identity \a — h\ = {a — b)la>b + {b— a)lo<& 
and noting that if sup^. f{x) = f{x*) then sup,^. f{x) — sup^. g{x) is upper bounded by f{x*) — g{x*) we get 

|F(zi,... ,z',... ,z„) - F(zi,.. .,Zj ,... ,z„)| 
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Now noting that \f{z) — f{z')\ G [0,2] we conclude with 

\F{zi ,..., z',..., z„) - F{zi, ...,Zj,...,Zn)\ 

2 2 _ 2 

Using McDiarmid’s inequality (Theorem 14 of (Song, 2008)) with Ci = 2/n we obtain that with probability not less than 
1 — 5 the following holds: 


sup ( E [/(z)] --^/(zi) ] < E 


sup ( E [/(z)] - f{zi) 


21og(l/5) 


n 


Finally, we proceed with the symmetrization step (Theorem 2.1 of (Koltchinskii, 2011)) which upper bounds the expected 
value of the sup-norm of empirical process with twice the Rademacher complexity of the class {/ S Hk ■ H/llHfe < 1} 
and with upper bound on this Rademacher complexity which can be found in Lemma 22 and related remarks of Bartlett & 
Mendelson (2002). 


We also note that the original statement of Theorem 27 in Song (2008) contains extra factor of 2 under logarithm compared 
to our modified result. This is explained by the fact that while we upper bounded the Rademacher complexity directly. Song 
(2008) instead upper bounds it in terms of the empirical (or conditional) Rademacher complexity which results in another 
application of McDiarmid’s inequality together with union bound. 


B.2. Theorem 3 


We will proceed as follows: 

RM'n) - RAH = RAfn) - RAfn) 

+ RAfn) - RAf) 

+ - i?^(r) 

<2 sup |i?^(/)-i?4/)| 
feFk 

= 2 sup |i?^(/) - RAf) + RAf) - RAf)\ 

feFk 

< 2 sup |i?^(/) - RAf)\ + 2 sup |i?^(/) - RAf)l (19) 

We will now upper bound two terms in (19) separately. 


We start with noticing that Theorem 2 can be used in order to upper bound the first term. All we need is to match the 
quantities appearing in our problem to the classical setting of learning theory, discussed in Section 3.1. Indeed, let p,(7^) play 
the role of input space Z. Thus the input objects are kernel mean embeddings of elements of V. According to Lemma 2, 
there is a distribution defined over /i(P) x C, which will play the role of unknown distribution P. Finally, i.i.d. pairs 
{ Ak{Pi),h) form the training sample. Thus, using Theorem 2 we get that with probability not less than 1 — 6/2 (w.r.t. 
the random training sample { AAPi), k) }"_i) the following holds true: 


sup \RAf) - Rvif)\ < 2LyE 

feFk 


1 

sup — 
feFk n- 


2=1 


/i^ 

V 2n 


(20) 


To deal with the second term in (19) we note that 


sup \R^{f) - Rvif)\ = sup 




1 J ^ 

< sup -Y\^i~^AAAPi))) 


- <p{-hfAk{Psi))) 

- ‘p{-hfAkiPs.)))\ 


1 \ 

< sup “X! “ fAk{PSi))\, 
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where we have used the Lipschitzness of the cost function ip. Using the Lipschitzness of the functionals / € we obtain: 

L " 

sup \R^{f) - RM)\ ^ sup — ^ \\lJ-k{Pi) - IJ-kiRsMuk- (21) 

Also note that the usual reasoning shows that if /i € Rk and ll^llw^ ^ 1 then: 

\h{z)\ = \{h,k{z,-))-HA < \\h\\-HA\k{z,-)\\-H^ = \\h\\-H^^k{z,z) < \/k{z,z) 

and hence ||/i||oo = sup^g^ \h{z)\ < 1 because our kernel is bounded. This allows us to use Theorem 1 to control every 
term in (21) and combine the resulting upper bounds in a union bound"^ over i = 1,..., n to show that for any fixed 
Pi,... ,Pn with probability not less than 1 — S/2 (w.r.t. the random samples {Si}"^]^) the following is trae: 


L L 

'V sup — V llMfc(-Pi) - /ife(-Ps.)||wfc < sup V 


^^ ^z^p[k{z,z)] 



( 22 ) 


The quantity 2n/S appears under the logarithm since for every i we have used Theorem 1 with S' = S/(2n). Combining 
(20) and (22) in a union bound together with (19) we finally get that with probability not less than 1 — (5 the following is true: 


RMn) - RM*) < ^L^RniP) + 2B 




where we have defined Ljr = supjgjr L/. 


B.3. Theorem 4 

Our proof is a simple combination of the duality equation (18) combined with the following lower bound on the supremum 
of empirical process presented in Theorem 2.3 of Bartlett & Mendelson (2006): 

Theorem 5. Let F be a class of real-valued functions defined on a set Z such that svvp j^p ||/||oo < 1- Let zi,..., Zn, z € Z 
be i.i.d. according to some probability measure P on Z. Set ap = supjg^ V[/( 2 :)]. Then there are universal constants 
c, c', and C for which the following holds: 


E 


sup 

f^F 


E [f{z)] 


1 

n 


i=l 



Furthermore, for every integer n>\ / o\, with probability at least c!, 


sup 

n 

E [f{z)] - - X 

> CE 

sup 

/e-P 

n ^^ 

2 = 1 


_/e-P 


E[/W] 


1 

n 


i=l 


We note that constants c, c', and C appearing in the last result do not depend on n, ap or any other quantities appearing in 
the statement. This can be verified by the inspection of the proof presented in (Bartlett & Mendelson, 2006). 

B.4. Lemma 1 

Proof Bochner’s theorem (Rudin, 1962) states that for any shift-invariant symmetric p.d. kernel k defined on Z x Z where 
Z = R'^ and any z,z' G Z the following holds: 

k{z,z')= j Pk{w)F^'"’^-^'^dw, (23) 

Note that the union bound results in the extra log n factor in our bound. We believe that this factor can be avoided using a refined 
proof technique, based on the application of McDiarmid’s inequality. This question is left for a future work. 
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where p^. is a positive and integrable Fourier transform of the kernel k. It is immediate to check that Fourier transform of 
such kernels k is always an even function, meaning pk{—w) = pk{w). Indeed, since k{z — z') = k{z’ — z) for all z,z' G Z 
(due to symmetry of the kernel) we have: 

Pk{w) '■= [ k{6)e^^'^'^^dd = ( k{S) cos{{w,S))dS = [ k{6) cos{—{w,6))dS = pk{—w) 

J z Jz Jz 

which holds for any w G Thus for any z, z' G we can write: 


k(z,z')= / pk{w){^cos{{w, z — z')) + i ■ sm{{w, z — z')))dw 
JR'i 

= / Pk{w){^cos{{w, z — z'))dw + i ■ / pk{w) sm{{w, z — z'))'^dw 

JRii JR'i 

= / pk{u)) cos{{w, z — z'))dw 
Js.'i 

= 2 / / -^pkiw) cos{{w, z) + b) cos{{w, z') + b) dbdw. 

jR<i Jo 27r 

Denote Ck = /j^d p{w)dw < oo. Next we will use identity cos(a — b) = ^ cos(a + x) cos{b + x)dx and introduce 
random variables b and w distributed according to if [0, 27r] and -^pk{w) respectively. Then we can rewrite 


k{z, z) = 2Ck E [cos((m, z) + b) cos{{w, z) + b)]. 

b.w 


(24) 


Now let Q be any probability distribution defined on Z. Then for any z,w G Z and b G [0, 27r] the function 

ffw.bO) ■= 2C'fe cos((m, z) + b) cos{{w, •) + b) 

belongs to the L 2 {Q) space. Namely, L 2 {Q) norm of such a function is finite. Moreover, it is bounded by 2Ck' 

\\9^,bi')\\L2iQ) = cos((m, z) + b) cos((m, t) + b)^ dQ{t) 

< iCi J dQ{t) = 'iCl- (25) 

Note that for any fixed x G Z and any random parameters w G Z and b G [0, 27r] the element j(-) is a random variable 
taking values in the L 2 {Q) space (which is Hilbert). Such Banach-space valued random variables are well studied objects 
(Ledoux & Talagrand, 1991) and a number of concentration results for them are known by now. We will use the following 
version of Hoeffding inequality which can be found in Lemma 4 of Rahimi & Recht (2008): 


Lemma 3. Let vi,..., Vm be i.i.d. random variables taking values in a ball of radius M centred around origin in a Hilbert 
space H. Then, for any <5 > 0, the following holds: 


1 

m 


-E 


1 

m 




< -(l + V21og(l/<5)). 
mV / 

H 


with probability higher than 1 — 5 over the random sample Vi,..., Vm- 


Note that Bochner’s formula (23) and particularly its simplified form (24) indicates that if w is distributed according to 
normalized Fourier transform -^Pk and b ~ (f ([0, 27r]) then E^, ^(-j] = k{z, •). Moreover, we can show that any 

element h of RKHS TLk also belongs to the L2{Q) space: 

\\hi-)\\UQ)= 

Ht,t)\\h\\n^dQ(t) < \\h\\^^ < oo, 


< 


(26) 
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where we have used the reproducing property of k in RKHS "Hfc, Cauchy-Schwartz inequality, and the fact that the kernel k 
is bounded. Thus we conclude that the function k{z, •) is also an element of L 2 {Q) space. 


This shows that if we have a sample of i.i.d. pairs {(tUi, ® m 9w t (') = ’) where 

are i.i.d. elements of Hilbert space L 2 {Q). We conclude the proof using concentration inequality for Hilbert spaces of 
Lemma 3 and a union bound over the elements z G S, since 


9UPS)--Y.9^r^i-) 


L2(Q) 


^ n 1 


i2(Q) 


1 ” 


2=1 


1 

n 


E 


k{zi, •) 


^ UL 

i-j 


L2(Q) 


where we have used the triangle inequality. 


□ 


B.5. Excess Risk Bound for Low-Dimensional Representations 

Let us first recall some important notations introduced in Section 3.3. For any w, z G Z and b G [0, 2tt\ we define the 
following functions 

9w,b{-) = 2C'fc cos((w, z) + h) cos{{w, ■} + b) G L 2 (Q), (27) 

where Ck = J 2 ; Pk{z)dz for p/j: Z —K being the Fourier transform of k. We sample m pairs {(wi, &i)}™ 1 i.i.d. from 
( ) X [0, 27r] and define the average function 


^ IIL 

9U-) = -Y.9l.,bS-)^L2{Q). 

‘ i=l 


Since cosine functions (27) do not necessarily belong to the RKHS T-Lk and we are going to use their linear combinations as 
a training points, our classifiers should now act on the whole L 2 {Q) space. To this end, we redefine the set of classifiers 
introduced in the Section 3.2 to be {sign of-.fG J^q} where now Tq is the set of functionals mapping L 2 {Q) to K. 

Recall that our goal is to find f* such that 


f* G arg min R^(f) := arg min E 




(28) 


As was pointed out in Section B.4 if the kernel k is bounded sup^g^ ^ 1 then P-k Q L 2 {Q). In particular, for any 

P G V it holds that fJ-k{P) G L 2 {Q) and thus (28) is well defined. 

Instead of solving (28) directly, we will again use the version of empirical risk minimization (ERM). However, this time 
we won’t use empirical mean embeddings {pk{Psi)}'i=i since, as was already discussed, those lead to the expensive 
computations involving the kernel matrix. Instead, we will pose the ERM problem in terms of the low-dimensional 
approximations based on cosines. Namely, we propose to use the following estimator /™: 


/r G arg min R, 


1 

■(/) :=aa'gmm 

2=1 



The following result puts together Theorem 3 and Lemma 1 to provide an excess risk bound for which accounts for all 
sources of the errors introduced in the learning pipeline: 

Theorem 6. Let Z = and Q be any probability distribution on Z. Consider the RKHS Rk associated with some 
bounded, continuous, shift-invariant kernel function k, such that sup^.^^ S 2 1- Consider a class Pq of functionals 
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mapping L 2 {Q) to M with Lipschitz constants uniformly bounded by Lq. Let (/?:!£—>■ K’*' be a L^p-Lipschitz function such 
that (j){z) > lz>o- Let (p(^—f{h)tj < B for every f G J-q, h G L 2 {Q), and I G C. Then for any i5 > 0 the following holds: 


RMn) - RMl < ^LpR^iRQ) + 

+ 2-^:^ X! 7^ + \/21og(3n • Hi/ 6 )^ 

with probability not less than 1 — (5 over all sources of randomness, which are {{Pi, ?z)}7=i> {>S'i}7=i> {(wi, 


Proof. We will proceed similarly to (19): 

RMn) - RMl = RMn) - Kifn) 

<2 sup \RM)-K(f)\ 

f&tFQ 

= 2 sup \RM) - RM) + KU) - RM) + RAf) - Mif)\ 

f&tFQ 

< 2 sup \RM) - RAf)\ + 2 sup \RM) - RAf)\ + 2 sup \RM) " K(f)\- (29) 

f&tpQ f&tpQ f&tpQ 


First two terms of (29) were upper bounded in Section B.2. Note that the upper bound of the second term (proved in 
Theorem 3) was based on the assumption that functionals in Fq are Lipschitz on Rk w.r.t. the Rk metric. But as we already 
noted, for bounded kernels we have Rk C L 2 {Q) which implies |lh||i 2 (Q) 7; ll^llHt for any h G Rk (see (26)). Thus 
\f{h) — f{h')\ < L f\\h — h'\\k 2 {Q) < Rf\\h — h'W-ki,. forany h, h' G Rk- It means that the assumptions of Theorem 3 hold 
true and we can safely apply it to upper bound the hrst two terms of (29). 

We are now going to upper bound the third one using Lemma 1 : 


sup \RAf) - RAU)\ = sup 

/6^q /e^Q 


{-f{MPs,))k) - 7 f ^ E Ami-) 


2 = 1 


2=1 


zGSi 


1 M 




i-fiMPs,))h) Ami-) h 


rii 


zCiSi 


< ^ V sup 
<^f] sup L 


f{MPs,))-f(-Y.Ami-) 


zeSi 


MPs.)--J2Ami-) 


z^Si 


FiQ) 


We can now use Lemma 1 combined in union bound over i = 1,..., n with 6' = 6/n. This will give us that 


sup \RM) - Mif)\ < ArM ^ AA ^21og(n-n,/5)] . 

^ TAi Am \ ) 


with probability not less than 1 — 5 over {(wi, 


□ 
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C. Training and Test Protocols for Section 5.4 

The synthesis of the training data for the experiments described in Section 5.4 follows a very similar procedure to the one 
from Section 5.1. The main difference here is that, when trying to infer the cause-effect relationship between two variables 
Xi and Xj belonging to a larger set of variables X = {Xi, ..., Xd), we will have to account for the effects of possible 
confounders Xk C X \ {Xi, Xj}. For the sake of simplicity, we will only consider one-dimensional confounding effects, 
that is, scalar X/.. 


C.l. Training Phase 

To generate cause-effect pairs exhibiting every possible scalar confounding effect, we will generate data from the eight 
possible directed acyclic graphs depicted in Figure 4. 


O O O 0-0-0 0-0-0 

0-0 O 0-0-0 o^^o 

0-0-0 o^^o 


Figure 4. The eight possible directed acyclic graphs on three variables. 

In particular, we will sample N different causal DAGs Gi,... ,Gn, where the Gi describes the causal structure underlying 
{Xi, Yi,Zi). Given Gi, we generate the sample set Si = {{xij,yij, according to the generative process described 

in Section 5.1. Together with Si, we annotate the triplet of labels {lii,li 2 , hs), where according to Gi, 

• li^ = -pi if “AT, ^ Yi’\ hi = -1 if “Xi ^ Yi", and la = 0 else. 

• k 2 = +1 if “Fi —>■ Zi”, li 2 = —1 if “Yi F- Zi”, and li 2 = 0 else. 

• Zi 3 = -pi if “X, ^ Zi”, la = -f if ^ Zi^\ and hi = 0 else. 

Then, we add the following six elements to our training set: 

{{{xij, iJij, Zij)}j^i, +/l), {{{uij, Zij, Xij)}j^i, +I 2 ), {{{xij, Zij, yij)}j—i, +^ 3 )) 

({ iVij 1 )} 1—1’ ^ 1 )’ ({ ’ V'i-j ’ )}l—1’ ^ 2 ); ({ {^ij t ^ij ■> Vij^} j —1 ’ 

for all 1 < i < X. Therefore, our training set will consist on 6X sample sets and their paired labels. At this point, and 
given any sample {{uij, Vij, Wij)}'j^i from the training set, we propose to use as feature vectors the concatenation of the 
TO—dimensional empirical kernel mean embeddings (14) of {uii}"=i, {vij}j'-i, and {{uij, Vij, Wij)}j'^i, respectively. 

C.l. Test Phase 

To start, given rite test d—dimensional samples S = {{xu,..., Xdi)}^Xi^ the hyper-parameters of the kernel and training 
data synthesis process are transductively chosen, as described in Section 5.1. 

In order to estimate the causal graph underlying the test sample set S, we compute three d x d matrices M_^, Mjj_, and 
M<_. Each of these matrices will contain, at their coordinates i, j, the probabilities of the labels “X^ —> Xj”, “Xi _LL Xj”, 
and “Xi F- Xj”, respectively, when averaged over all possible scalar confounders X^. Using these matrices, we estimate 
the underlying causal graph by selecting the type of each edge (forward, backward, or no edge) to be the one with maximal 
probability according. As a post-processing step, we prune the least-confident edges until the derived graph is acyclic. 



